1 00:00:00,400 --> 00:00:08,610 Hello, everyone, welcome back to the Heterogeneous Parallel Programming class. 2 00:00:10,050 --> 00:00:14,910 This is lecture 1.4, Introduction to CUDA, and we're 3 00:00:14,910 --> 00:00:18,460 going to be talking about Data Parallelism and Threads. 4 00:00:20,110 --> 00:00:22,030 The objective of this lecture is to help you 5 00:00:22,030 --> 00:00:25,660 to learn about data parallelism and the basic features of 6 00:00:25,660 --> 00:00:30,890 CUDA C, which is a heterogeneous parallel programming interface that enables the 7 00:00:30,890 --> 00:00:36,070 exploitation of data parallelism by using both CPUs and GPUs. 8 00:00:37,180 --> 00:00:39,830 And, the topics that we're going to 9 00:00:39,830 --> 00:00:43,480 cover today are the hierarchical thread organization, 10 00:00:43,480 --> 00:00:46,550 the main interfaces for launching parallel execution, 11 00:00:46,550 --> 00:00:49,100 and the thread index to data index mapping. 12 00:00:51,960 --> 00:00:58,530 The phenomena of data parallelism is that, different parts of the data can actually 13 00:00:58,530 --> 00:01:04,110 be processed independently of each other. A very simple example is vector addition. 14 00:01:04,110 --> 00:01:05,870 When we add two vectors together, 15 00:01:05,870 --> 00:01:08,718 the elements can be added together independently. 16 00:01:08,718 --> 00:01:13,168 A[0] and B[0] can be added into, to form C[0] 17 00:01:13,168 --> 00:01:17,974 and A[1] and B[1] can added to form C[1] independently 18 00:01:17,974 --> 00:01:19,990 of each other. 19 00:01:19,990 --> 00:01:25,180 So, if we have a large number of elements in each vectors, and, we have a 20 00:01:25,180 --> 00:01:27,790 large amount of hardware, we should be able 21 00:01:27,790 --> 00:01:32,150 to eh, to perform all these additions in parallel. 22 00:01:32,150 --> 00:01:35,400 And that's how we can fundamentally, 23 00:01:35,400 --> 00:01:40,400 achieve high performance in a, CUDA, program. 24 00:01:40,400 --> 00:01:43,050 So that's the reason why we're going to be using 25 00:01:43,050 --> 00:01:47,420 this very simple example to illustrate the basic concepts of CUDA. 26 00:01:50,500 --> 00:01:55,970 The execute, parallel execution model of CUDA and its close relative Open CL 27 00:01:55,970 --> 00:02:02,370 are both based on a host plus device kind of, arrangement. 28 00:02:02,370 --> 00:02:07,370 So, the, the basic concept is that when we start executing a piece of 29 00:02:07,370 --> 00:02:12,930 code or application, the application will be executing on the host. 30 00:02:12,930 --> 00:02:15,955 The host is typically a CPU core. 31 00:02:15,955 --> 00:02:23,810 And when we start to, to, execute into a parallel part of the application, that's 32 00:02:23,810 --> 00:02:30,840 when we, we start to have an opportunity to, to use a throughput oriented, device. 33 00:02:30,840 --> 00:02:33,140 So this is done by writing 34 00:02:33,140 --> 00:02:36,290 some specialized functions called kernel functions. 35 00:02:36,290 --> 00:02:38,020 These kernel functions are very similar 36 00:02:38,020 --> 00:02:41,840 to, functions in the C programming language. 37 00:02:41,840 --> 00:02:47,160 They also take parameters or arguments. So that's why I showed that, 38 00:02:47,160 --> 00:02:52,020 for a kernel function called KernelA, it will take arguments 39 00:02:52,020 --> 00:02:56,710 just like the C functions. However, they also take configuration 40 00:02:56,710 --> 00:03:02,270 parameters and, these configuring parameters are specially noted 41 00:03:02,270 --> 00:03:06,980 by three less signs in the front and three greater-than signs in 42 00:03:06,980 --> 00:03:08,170 the back. 43 00:03:08,170 --> 00:03:13,670 In between, it gives the, a configuration parameter of number of thread 44 00:03:13,670 --> 00:03:19,180 blocks, in the grid and number of threads in a thread block. 45 00:03:19,180 --> 00:03:25,480 So in this, example on the right hand side, we show that the, the kernel will be 46 00:03:25,480 --> 00:03:28,430 executed by a number of thread blocks which 47 00:03:28,430 --> 00:03:32,200 are shown as rectangular blocks with, each with several 48 00:03:32,200 --> 00:03:33,560 threads in them. 49 00:03:33,560 --> 00:03:35,720 And then, within each thread block, we will 50 00:03:35,720 --> 00:03:39,120 have a number of threads, all executing in parallel. 51 00:03:40,270 --> 00:03:46,920 So then, after the execution of the kernel function, we will return to a sequential 52 00:03:46,920 --> 00:03:49,630 execution, sequential part of the application, so 53 00:03:49,630 --> 00:03:53,608 we return to the host for sequential execution. 54 00:03:53,608 --> 00:03:57,270 And then, while we reach another part of the application 55 00:03:57,270 --> 00:04:00,580 where we have an opportunity, we write 56 00:04:00,580 --> 00:04:04,030 another KernelB that can be executed in parallel. 57 00:04:04,030 --> 00:04:07,390 So the execution will go between host and 58 00:04:07,390 --> 00:04:12,340 device, device in CUDA means parallel execution device. 59 00:04:12,340 --> 00:04:22,200 And, most of the time, the device are, correspond to throughput-oriented GPUs. 60 00:04:22,200 --> 00:04:28,790 This picture shows a, the levels of attraction in a computer system. 61 00:04:28,790 --> 00:04:32,678 When we, typically, when we have an application, 62 00:04:32,678 --> 00:04:36,890 the application solves a problem at the human level. 63 00:04:36,890 --> 00:04:39,740 So we will have natural language description of the 64 00:04:39,740 --> 00:04:43,810 problems that, that the application should be, should solve. 65 00:04:43,810 --> 00:04:47,600 And then, based on the description of the problem, we can, 66 00:04:47,600 --> 00:04:52,570 we'll define algorithms which have very well-defined steps of 67 00:04:52,570 --> 00:04:58,660 computation and a well defined criteria for terminating the computation. 68 00:04:58,660 --> 00:05:02,630 And then the algorithms will be implemented with, programming 69 00:05:02,630 --> 00:05:05,560 languages such as C and C++ and so on. 70 00:05:05,560 --> 00:05:11,480 CUDA is really a programming language at the C level. 71 00:05:11,480 --> 00:05:13,350 And it's actually 72 00:05:13,350 --> 00:05:17,720 designed as an extension to the C language. 73 00:05:17,720 --> 00:05:24,420 And, more recently, more and more of the C++ features are also available in CUDA. 74 00:05:24,420 --> 00:05:30,188 So CUDA is becoming more and more of a C++ programming language extension. 75 00:05:30,188 --> 00:05:35,110 However, in order to fully understand the behavior of the CUDA 76 00:05:35,110 --> 00:05:38,740 programs, often I will have to go a little bit lower into 77 00:05:38,740 --> 00:05:40,870 the abstraction layers and look at 78 00:05:40,870 --> 00:05:43,930 the instruction set architecture and micro-architecture, 79 00:05:43,930 --> 00:05:50,160 which is the organization of hardware to be able to, to execute programs. 80 00:05:50,160 --> 00:05:55,120 And, so therefore, we're going to actually go down a little bit in today's lecture, 81 00:05:55,120 --> 00:05:57,310 just so that, you can have a 82 00:05:57,310 --> 00:06:01,940 solid understanding of the execution model of CUDA. 83 00:06:01,940 --> 00:06:04,380 And this is, based on a, 84 00:06:04,380 --> 00:06:08,380 a picture from Patt and Patel, Introduction to Computing 85 00:06:08,380 --> 00:06:11,350 Systems from Bits to Bytes, and To Gates and Beyond. 86 00:06:14,570 --> 00:06:19,010 Let's go a little bit into the concept of ISA or Instruction Set Architecture. 87 00:06:19,010 --> 00:06:24,550 And the Instruction Set Architecture is a contract between hardware and software. 88 00:06:24,550 --> 00:06:29,340 Mostly, the Instruction Set Architecture, specifies a 89 00:06:29,340 --> 00:06:33,990 set of instructions that the hardware can execute. 90 00:06:33,990 --> 00:06:35,840 As long as the software consists of these 91 00:06:35,840 --> 00:06:39,840 instructions, the hardware knows what to do, according for 92 00:06:39,840 --> 00:06:45,620 each, for the software. So whenever we write a piece of code in 93 00:06:45,620 --> 00:06:51,160 CUDA, it will eventually be compiled down to the Instruction Set Architecture level. 94 00:06:51,160 --> 00:06:58,440 And the program is, the program is said to be at the Instructions Set level whenever 95 00:06:58,440 --> 00:07:04,840 it is compiled down to these instructions. At this level a program is really 96 00:07:04,840 --> 00:07:07,160 a set of instructions stored in memory that 97 00:07:07,160 --> 00:07:11,250 can be read, interpreted, and executed by the hardware. 98 00:07:13,330 --> 00:07:18,440 But program instructions will then operate on the data 99 00:07:18,440 --> 00:07:21,834 that are stored in memory or provided by input/output devices. 100 00:07:22,920 --> 00:07:29,740 The next slides shows a, very simplified diagram of how the hardware is 101 00:07:29,740 --> 00:07:33,260 typically organized to execute programs that are 102 00:07:33,260 --> 00:07:36,410 represented at the Instruction Set Architecture level. 103 00:07:36,410 --> 00:07:38,450 And this diagram is 104 00:07:38,450 --> 00:07:42,980 really based on Von-Neumann Processor model, which was 105 00:07:42,980 --> 00:07:45,430 proposed by John von Neumann in the 1940s. 106 00:07:46,650 --> 00:07:50,780 Today, virtually all the processor cores are designed 107 00:07:50,780 --> 00:07:54,430 based on this or variations of this particular model. 108 00:07:55,560 --> 00:07:58,080 So let's start with the bottom. 109 00:07:58,080 --> 00:08:03,670 The bottom here shows the Control Unit and the Control Unit contains a 110 00:08:03,670 --> 00:08:06,470 program counter and the instruction register. 111 00:08:06,470 --> 00:08:12,720 The program counter essentially specifies the location in memory where the hardware 112 00:08:12,720 --> 00:08:18,020 can find the next instruction that should be executed for the application. 113 00:08:18,020 --> 00:08:22,620 So, there is a dash line going from the control unit to the memory essentially 114 00:08:22,620 --> 00:08:29,090 that this, the, the communication pass for the control unit to deliver 115 00:08:29,090 --> 00:08:36,740 the PC value to the memory and demand the memory to, to return the instruction bits. 116 00:08:36,740 --> 00:08:39,670 Once the instruction bits return from the memory, 117 00:08:39,670 --> 00:08:44,830 it will be placed into instruction register, or IR. 118 00:08:44,830 --> 00:08:49,500 That's where the hardware will examine all the instruction bits and determine 119 00:08:49,500 --> 00:08:53,980 all the activities that need to happen in order to execute that instruction. 120 00:08:53,980 --> 00:08:57,420 And these activities are actually coordinated by the 121 00:08:57,420 --> 00:09:01,060 control signals, which are represented with the, by 122 00:09:01,060 --> 00:09:03,330 the dashed line going from control unit to 123 00:09:03,330 --> 00:09:05,520 the processing unit in the middle of the picture. 124 00:09:06,620 --> 00:09:10,720 This, these control signals essentially define the 125 00:09:10,720 --> 00:09:13,910 activities that the ALU, the register file, 126 00:09:13,910 --> 00:09:19,070 and other components in the com-, the processing unit need to, need to take in 127 00:09:19,070 --> 00:09:23,807 every clock cycle in order to execute the instruction. 128 00:09:23,807 --> 00:09:28,080 And, during the execution, some of the instructions will need to be 129 00:09:28,080 --> 00:09:34,260 able to access data, read or write data from or to memory. 130 00:09:34,260 --> 00:09:38,730 And that's, indicated by the upward arrow and the downward 131 00:09:38,730 --> 00:09:43,080 arrow of the, between the processing unit and the memory. 132 00:09:43,080 --> 00:09:43,990 So, 133 00:09:43,990 --> 00:09:49,440 depending on the type of instruction, the, the execution will have activities 134 00:09:49,440 --> 00:09:54,520 in ALUs and register files and, and accessing memory and so on. 135 00:09:54,520 --> 00:09:58,870 Finally, some of the data will be moved 136 00:09:58,870 --> 00:10:01,628 back and forth between the memory and the I/O. 137 00:10:01,628 --> 00:10:05,960 The I/O represents the network and the disks 138 00:10:05,960 --> 00:10:08,550 and so on, and the displays and so on. 139 00:10:08,550 --> 00:10:09,150 So the 140 00:10:09,150 --> 00:10:13,020 data will be moved back and forth between memory and I/O. 141 00:10:13,020 --> 00:10:15,200 And so we're actually going to see a little bit 142 00:10:15,200 --> 00:10:18,560 of I/O activity as well for the rest of the class. 143 00:10:21,520 --> 00:10:30,030 So now we are ready to talk about more specifics about a, of a CUDA thread. 144 00:10:30,030 --> 00:10:35,670 A CUDA thread is really a virtualized or abstracted Von-Neumann processor. 145 00:10:35,670 --> 00:10:39,980 You can think about every CUDA thread as one of these processors 146 00:10:39,980 --> 00:10:45,570 and these, each of these processors will be able to execute a program. 147 00:10:45,570 --> 00:10:46,820 So the kernel 148 00:10:46,820 --> 00:10:49,840 function that we describe is that program. 149 00:10:49,840 --> 00:10:54,370 And the hardware will actually generate 150 00:10:54,370 --> 00:10:57,850 a large number of these Von-Neumann processors, 151 00:10:57,850 --> 00:11:01,770 and each of them will, hardware provides 152 00:11:01,770 --> 00:11:04,780 a large number of these Von-Neumann processors. 153 00:11:04,780 --> 00:11:06,840 And each of these processors will 154 00:11:06,840 --> 00:11:11,430 be executing that function, the kernel function. 155 00:11:11,430 --> 00:11:11,805 So, 156 00:11:11,805 --> 00:11:14,870 but it's virtualized in the sense that, if 157 00:11:14,870 --> 00:11:17,400 you look at the hardware, the number of real 158 00:11:17,400 --> 00:11:20,650 processors may be much, much smaller than the 159 00:11:20,650 --> 00:11:25,100 threads that a CUDA program will pro, will create. 160 00:11:25,100 --> 00:11:29,000 So a lot of these threads will actually need 161 00:11:29,000 --> 00:11:31,819 to be, to be executed by the real processor. 162 00:11:33,684 --> 00:11:36,860 In turn. So some of them will 163 00:11:36,860 --> 00:11:41,650 be actively executing and some of them will not be actively executing, and this 164 00:11:41,650 --> 00:11:47,070 is what we call context switching, and that will be, it, we'll 165 00:11:47,070 --> 00:11:52,440 elaborate on that very soon in one of the future lectures. 166 00:11:54,550 --> 00:11:59,290 So let's go into the way a CUDA 167 00:11:59,290 --> 00:12:04,430 programmer think about threads. So whenever a 168 00:12:04,430 --> 00:12:09,670 CUDA kernel is executed, it's executed by a grid or array 169 00:12:09,670 --> 00:12:14,897 of threads. So here we show a one-dimensional, 170 00:12:15,930 --> 00:12:19,820 thread block. And let's say, let's assume 171 00:12:19,820 --> 00:12:25,800 for the moment that the grid has only one thread block for this, for the moment. 172 00:12:25,800 --> 00:12:30,411 So all the threads would, actually, run the same 173 00:12:30,411 --> 00:12:33,730 code as we described before, but every thread will have 174 00:12:33,730 --> 00:12:37,340 a different index value, or thread index value, that it 175 00:12:37,340 --> 00:12:42,170 will use to compute memory addresses and make controlled decisions. 176 00:12:42,170 --> 00:12:45,950 In this particular example, we show that 177 00:12:45,950 --> 00:12:49,200 there are 256 threads in the third block. 178 00:12:49,200 --> 00:12:54,230 And each of them will have a unique thread index from 0 to 255. 179 00:12:54,230 --> 00:12:59,050 And there's a piece of code that, in the kernel, that I'm 180 00:12:59,050 --> 00:13:03,910 showing in the, in the, box underneath, and that box is 181 00:13:03,910 --> 00:13:10,110 first calculates a i variable based on the thread index. 182 00:13:10,110 --> 00:13:11,254 This i variable 183 00:13:11,254 --> 00:13:17,238 is actually private to every thread, that is, every processor, Von-Neumann 184 00:13:17,238 --> 00:13:23,780 processor that correspond to those threads will have a unique i variable. 185 00:13:23,780 --> 00:13:29,170 Thread ID, thread 0 will have its own i, thread 1 will have its own i and so on. 186 00:13:30,660 --> 00:13:36,370 Thread 0 will calculate its i value as 0 in this case because 187 00:13:36,370 --> 00:13:42,630 the thread index on x value for thread 0 is 0, so i will be 0 in this case, 188 00:13:42,630 --> 00:13:48,890 for thread 0. Thread 2 will have i value 189 00:13:48,890 --> 00:13:55,000 2 and so on. And thread 255 will have i value 255. 190 00:13:55,000 --> 00:14:01,283 So, when we execute the statement, C[i] equals A[i] plus B[i], 191 00:14:01,283 --> 00:14:05,940 the i value for every thread will be different. 192 00:14:05,940 --> 00:14:11,270 So thread 0 will be adding A[0] plus B[0] and assign that to C[0]. 193 00:14:11,270 --> 00:14:16,436 And thread 255 will be adding A[255] plus 194 00:14:16,436 --> 00:14:21,300 B[255] and assign that to C[255]. 195 00:14:21,300 --> 00:14:26,520 Now that we understand how a single thread 196 00:14:26,520 --> 00:14:31,010 block works, we can now expand to multiple thread blocks. 197 00:14:31,010 --> 00:14:37,440 Here we show a grid of threads that are organized into n thread blocks. 198 00:14:37,440 --> 00:14:41,380 And each thread block still consists of 256 threads. 199 00:14:42,470 --> 00:14:49,430 So, here, every thread not only have a thread index, but also a block index. 200 00:14:49,430 --> 00:14:55,490 The block index variable is called blockidx.x. 201 00:14:55,490 --> 00:15:01,160 The thread index is called threadidx.x. 202 00:15:01,160 --> 00:15:06,280 These are predefined CUDA variables that we can use in a 203 00:15:06,280 --> 00:15:12,220 kernel, and they actually are initialized by the hardware for each thread. 204 00:15:12,220 --> 00:15:14,930 So you don't need to initialize these variables 205 00:15:14,930 --> 00:15:19,280 because the system initializes these variables for every thread. 206 00:15:19,280 --> 00:15:19,595 Now, 207 00:15:19,595 --> 00:15:24,240 when we formed a data index, now we need to 208 00:15:24,240 --> 00:15:29,940 factor in both the thread index and the, block index. 209 00:15:29,940 --> 00:15:33,980 So in order to calculate i, now we, we take the block 210 00:15:33,980 --> 00:15:39,110 index and multiply that by the block dimension, in this case, 256. 211 00:15:39,110 --> 00:15:44,600 And then we add it to the thread index, so the, 212 00:15:44,600 --> 00:15:49,980 thread 0 in, in block 0 still has 213 00:15:49,980 --> 00:15:54,700 i value 0, because the block index is, value is 0 in this 214 00:15:54,700 --> 00:15:59,930 case. So, but if we look at, thread 215 00:15:59,930 --> 00:16:05,270 block 1, now all the threads are going to be seeing 216 00:16:05,270 --> 00:16:10,090 thread blockidx.x value 256. 217 00:16:10,090 --> 00:16:15,381 So thread 0 in block 1 will actually have a i 218 00:16:15,381 --> 00:16:20,830 value of 256 instead of 0. And 219 00:16:20,830 --> 00:16:25,700 then, obviously, the next thread block, thread 0 in thread block 2, 220 00:16:25,700 --> 00:16:30,610 will have for the i value of 512. 221 00:16:30,610 --> 00:16:35,098 So this is the, as a result, the i values 222 00:16:35,098 --> 00:16:40,114 of the first thread block will range from 0 to 223 00:16:40,114 --> 00:16:45,888 255. And all the i values from thread 224 00:16:45,888 --> 00:16:50,928 block 1 will range from 256 to 225 00:16:50,928 --> 00:16:56,248 511. As you can see, now all the threads 226 00:16:56,248 --> 00:17:01,982 form a uniform coverage of all the array elements. 227 00:17:01,982 --> 00:17:07,188 0 to 255 in the first thread block and then 228 00:17:07,188 --> 00:17:12,394 256 on to 511 in the next thread block, 229 00:17:12,394 --> 00:17:17,737 and 512 to, 780, 60, 230 00:17:17,737 --> 00:17:22,750 67 in the next, block. 231 00:17:22,750 --> 00:17:27,070 So, you wish, this kind of coverage is what we call a linear 232 00:17:27,070 --> 00:17:32,520 coverage of a one-dimensional array. By using this formula, we can make 233 00:17:32,520 --> 00:17:38,220 sure that every element of A, B, and C is covered by one of the threads. 234 00:17:39,310 --> 00:17:45,050 In reality, we maybe, have, each thread to cover more than one element 235 00:17:45,050 --> 00:17:48,910 and we will come back to this point, but at, at, a later point. 236 00:17:48,910 --> 00:17:53,270 But, at this point, it's sufficient to understand how, 237 00:17:53,270 --> 00:18:00,450 just how we can map every thread to a unique, array index, array element. 238 00:18:00,450 --> 00:18:06,970 So, the threads within a thread block can actually 239 00:18:06,970 --> 00:18:12,540 cooperate through sheer memory, atomic operations, and barrier synchronization. 240 00:18:14,010 --> 00:18:18,340 At this point, it's sufficient for you to just be, 241 00:18:18,340 --> 00:18:21,000 familiar with these three terms, because we are 242 00:18:21,000 --> 00:18:23,870 going to actually go into much more detail. 243 00:18:23,870 --> 00:18:28,390 Essentially, shared memory allows the threads to exchange data, the atomic 244 00:18:28,390 --> 00:18:33,110 operations allow the threads to be able to coordinate their updates 245 00:18:33,110 --> 00:18:37,340 to the same variables and the barrier synchronization allow the threads 246 00:18:37,340 --> 00:18:41,860 to, to force each other, to force others to wait for them. 247 00:18:41,860 --> 00:18:44,230 So all these activities 248 00:18:44,230 --> 00:18:48,810 allow a coordination of activities across different 249 00:18:48,810 --> 00:18:53,170 threads. However, threads in different blocks do 250 00:18:53,170 --> 00:18:58,980 not interact. So thread 0 through 255 in thread 251 00:18:58,980 --> 00:19:03,822 block 0 cannot interact with the threads 252 00:19:03,822 --> 00:19:09,420 0-256 in thread block 1. There's no real 253 00:19:09,420 --> 00:19:13,230 interaction that we can have between these thread blocks. 254 00:19:13,230 --> 00:19:15,280 So this is going to be an important 255 00:19:15,280 --> 00:19:19,520 as we, to understand scalability in your CUDA code. 256 00:19:21,410 --> 00:19:28,354 Now, the thread index and block index are not just one-dimensional, 257 00:19:28,354 --> 00:19:34,600 indices. In CUDA, each dimens-, each thread, 258 00:19:34,600 --> 00:19:39,340 block index can be a 1D, 2D, or 3D variable. 259 00:19:39,340 --> 00:19:43,650 So, each thread index can also be a 1D, 2D, 260 00:19:43,650 --> 00:19:47,218 and 3D. That's why in the previous, slide. 261 00:19:47,218 --> 00:19:52,723 We'll go back to the previous slide here. When we talk about a thread index, I 262 00:19:52,723 --> 00:19:59,635 actually have blockidx.x, because we're only using the first dimension 263 00:19:59,635 --> 00:20:02,421 of the block index variable. 264 00:20:02,421 --> 00:20:05,157 In reality, many applications operate on 265 00:20:05,157 --> 00:20:09,261 two-dimensional data, such as images, three-dimensional 266 00:20:09,261 --> 00:20:11,009 data, such as volume, in a 267 00:20:11,009 --> 00:20:15,930 differential equation solver for computational fluid dynamics. 268 00:20:15,930 --> 00:20:20,010 So that's why it's very convenient to be able to 269 00:20:20,010 --> 00:20:25,640 use 1D, 2D, or 3D block indices and thread indexes. 270 00:20:25,640 --> 00:20:29,686 So that we can map that conveniently to a two-dimensional data or a 271 00:20:29,686 --> 00:20:36,640 three-dimensional data and maintain and keep the program easy to read. 272 00:20:36,640 --> 00:20:42,498 So here I'm showing a two, two-dimensional block structure and, 273 00:20:42,498 --> 00:20:47,920 within each block, a three-dimensional thread structure. 274 00:20:47,920 --> 00:20:50,930 So, when we look at the grid, we see that each block 275 00:20:50,930 --> 00:20:55,540 has two indices, the X and Y indices. 276 00:20:55,540 --> 00:20:59,130 And X is the first one, Y is the second one. 277 00:20:59,130 --> 00:21:02,760 And, so, actually, it's the other way around. 278 00:21:02,760 --> 00:21:07,470 X is the right, the right element and Y is the left element. 279 00:21:07,470 --> 00:21:12,830 So, at the top, we have block 0, 0 and block 0, 1. 280 00:21:12,830 --> 00:21:15,994 So Y equal to 0 for both blocks and 281 00:21:15,994 --> 00:21:21,370 X equal to 0 and 1. And now, the second row 282 00:21:21,370 --> 00:21:26,020 has Y value of 1, and X will vary from 0 to 1. 283 00:21:26,020 --> 00:21:30,955 Obviously, X and Y could vary from 0 to a very large 284 00:21:30,955 --> 00:21:36,402 number, usually in the, tens or even in the hundreds. 285 00:21:36,402 --> 00:21:41,196 So now, we, each dimension in 286 00:21:41,196 --> 00:21:47,156 CUDA can grow to, 600, 2 to the 16th. 287 00:21:47,156 --> 00:21:51,638 Now, when we look at each thread in a block, which I'm showing a 288 00:21:51,638 --> 00:21:57,529 three-dimensional, thread, thread organization within the block. 289 00:21:57,529 --> 00:22:01,312 So I'm expending out block 1, 1 here and I, 290 00:22:01,312 --> 00:22:06,162 I show that, in this toy example, there are 16 threads, 291 00:22:06,162 --> 00:22:10,604 and each thread has a unique three-dimensional ID. 292 00:22:10,604 --> 00:22:15,760 The X ID varies from zero to three, and the Y ID 293 00:22:15,760 --> 00:22:20,950 ranges from zero to one, and Z ID ranges from zero 294 00:22:20,950 --> 00:22:26,263 to one, and that gives us 16 possibilities. 295 00:22:26,263 --> 00:22:27,825 So, as you can see, we can 296 00:22:27,825 --> 00:22:32,298 combine a two-dimensional grid organization with a three-dimensional 297 00:22:32,298 --> 00:22:34,428 block organization, or we can have 298 00:22:34,428 --> 00:22:40,540 three-dimension, three-dimension, two-dimensional, and plus two-dimensional. 299 00:22:40,540 --> 00:22:44,410 So this all depends on the needs of your application. 300 00:22:44,410 --> 00:22:46,490 As we mentioned, this kind of 301 00:22:46,490 --> 00:22:49,850 multidimensional index is very convenient, or we 302 00:22:49,850 --> 00:22:52,395 need to do image processing, or solve 303 00:22:52,395 --> 00:22:56,080 three-dimensional partial differential equations, and so on. 304 00:22:58,080 --> 00:23:02,860 So this concludes the first episode of the introduction to CUDA. 305 00:23:02,860 --> 00:23:07,610 If you'd like to learn more about the basic concepts of data 306 00:23:07,610 --> 00:23:12,930 parallelism and the, the, basic execution model of CUDA, I 307 00:23:12,930 --> 00:23:18,402 would like to encourage you to read Chapter Three of the textbook. 308 00:23:18,402 --> 00:23:19,220 Thank you.